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1 Introduction 

It is nowadays common to build databases integrating information from multiple, autonomous, 
distributed data sources. The problem of data integration is nevertheless very complex In 
this paper, we consider a specific issue arising in data integration; how to obtain reliable, consistent 
information from inconsistent databases - databases that do not have to satisfy given integrity 
constraints. Such databases occur in a natural way in data integration, since there is typically no 
global monitor that could guarantee that the integrated database satisfies the constraints. The 
data sources are independent and even if they separately satisfy the constraints, the integrated 
database may fail to do so. For example, different data sources may contain different, locally unique 
addresses for the same person, leading to the violation of the global uniqueness constraint for people's 
addresses. Inconsistent databases occur also in other contexts. For instance, integrity constraints 
may fail to be enforced for efficiency reasons, or because the inconsistencies are temporary. Or, there 
may be not enough information to resolve inconsistencies, while the database may have to continue 
being used for real-time decision support. 

To formalize the notion of consistent information obtained from a (possibly inconsistent) database 
in response to a user query, we proposed in 0| the notion of a consistent query answer. A consistent 
answer is, intuitively, true regardless of the way the database is fixed to remove constraint violations. 
Thus answer consistency serves as an indication of its reliability. The different ways of fixing an 
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inconsistent database are formalized using the notion of repair, another database that is consistent 
and minimaUy differs from the original database. 

Example 1 Consider the following instance of a relation Person 



Name 


City 


Street 


Brown 


Amherst 


115 Klein 


Brown 


Amherst 


120 Maple 


Green 


Clarence 


4000 Transit 



and the functional dependency Name —> City Street. Clearly, the above instance does not sat- 
isfy the dependency. There are two repairs: one is obtained by removing the first tuple, the other 
by removing the second. The consistent answer to the query Person{n,c, s) is just the tuple 
(Green, Clarence, 4000 Transit). On the other hand, the query 3s[Person{n,c, s)] has two consistent 
answers: (Brown, Amherst) and (Green, Clarence). Similarly, the query 

Person(Brown, Amherst, 115 Klein) V Person(Brown, Amherst, 120 Maple) 

has true as the consistent answer. Notice that for the last two queries the approach based on removing 
all inconsistent tuples and evaluating the original query using the remaining tuples gives different, 
less informative results. 

In 0], in addition to a formal definition of a consistent query answer, a computational mechanism 
for obtaining such answers was presented in the context of first-order queries. In §, the same prob- 
lem was studied for scalar aggregation queries. In ||l[ some cases were identified where consistent 
query answers are tractable (in PTIME). In the present paper, we provide a complete classification 
of the computational complexity of computing consistent query answers to first-order queries. We 
consider functional dependencies (FDs) and their generalization: denial constraints. Denial con- 
straints allow an arbitrary number of literals per constraint and arbitrary built-in predicates. They 
also relax the typedness restriction of FDs. Denial constraints are particularly useful for databases 
with interpreted data, e.g., numbers. Their implication problem was studied in 

Example 2 The constraint that no employee can have a salary greater than that of her manager is 
the denial constraint 

Vn, s, m, s', m' .^[Emp{n, s, m) A Emp{m, s', m') A s > s']. 

The results of |^ imply that for binary denial constraints consistent answers can be computed 
in PTIME for queries that are conjunctions of literals. In the present paper we strengthen that 
result to arbitrary quantifier- free queries and arbitrary denial constraints. We also identify a class 
of restricted existentially quantified queries (consisting of single literals), for which consistent query 
answers can be computed in PTIME. In general, we show how the complexity depends on the type 
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of the constraints considered, their number, and the size of the query. Related work is discussed in 
depth in ||, Q . Other papers that adopt our notion of consistent query answer include |8[ Q . 

2 Basic Notions 

In this paper we assume that we have a fixed database schema containing only one relation schema 
R with the set of attributes U. We will denote elements oi U by A, B , . . . , subsets oi U hy X,Y, . . . , 
and the union of X and Y by XY. We also have two fixed, disjoint infinite database domains: D 
(uninterpreted constants) and (numbers). We assume that elements of the domains with different 
names are different. The database instances can be seen as first order structures that share the 
domains D and N . Every attribute in U is typed, thus all the instances of R can contain only 
elements either of D or of iV in a single attribute. Since each instance is finite, it has a finite active 
domain which is a subset oi D U N. As usual, we allow the standard built-in predicates over N 
7^, <, >, <, >) that have infinite, fixed extensions. 

Integrity constraints are typed, closed first-order formulas over the vocabulary consisting of R 
and the built-in predicates over N . 

Definition 1 Given a database instance r of R and a set of integrity constraints F , we say that r 
is consistent if r \^ F in the standard model-theoretic sense; inconsistent otherwise. 

We consider the following classes of integrity constraints: 

• denial constraints: formulas of the form Va;i, . . . Xk-~'[R{xi)A- ■ ■ /\R{xm)/\(j){xi, . . . ,Xm)] where 
Xi, . . . , Xm are tuples of variables and constants, and (/) is a conjunction of atomic formulas 
referring to built-in predicates; 

• functional dependencies (FDs) X Y over the set U {key FDs if X is a key of R). 
Clearly, functional dependencies are a special case of denial constraints. 

Definition 2 For the instances r, r', r" , r' <r r" if r — r' C r — r" . □ 

Definition 3 Given a set of integrity constraints F and database instances r and r' , we say that r' 
is a repair of r w.r.t. F if r' ^ F and r' is <r-minimal in the class of database instances that satisfy 
F. □ 

We denote by Repairs p{r) the set of repairs of r w.r.t. F. For any set of denial constraints, all 
the repairs are obtained by deleting tuples from the table. 

Definition 4 ^ Given a set of integrity constraints F and a database instance r, we say that a 
(ground) tuple t is a consistent answer to a query Q{x) w.r.t. F in r, and we write r \=p Q(t) if for 
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every r' G Repairs p{r), r' 1= Q{t). If Q is a sentence, then true (false) is a consistent answer to Q 
w.r.t. F in r, and we write r Q (r ~'Q), if for every r' £ Repairs p{r) , r' \= Q (r' \^ Q). n 

3 Data Complexity of Consistent Query Answers 

Assume a class of databases T>, a class of queries C and a class of integrity constraints IC are 
given. We study here the data complexity [ill of consistent query answers, i.e., the complexity of 
(deciding the membership of) the sets Dp^^p — {r : r ^T) /\r \=f 4)} for a fixed sentence cf) & C and 
a fixed finite set F e TC of integrity constraints. 

Proposition 1 ^ For any set of denial constraints F and sentence 4>, Dp^^ is in co-NP. 

It is easy to see that even under a single key FD, there may be exponentially many repairs and 
thus the approach to computing consistent query answers by generating and examining all repairs 
is not feasible. 

Example 3 Consider the functional dependency A ^ B and the following family of relation in- 
stances Tn, n > 0, each of which has 2n tuples (represented as columns) and 2" repairs: 



rn 




A 


ai 


ai 


0-2 






B 


bo 


bi 


bo 


bi ■ 


bo bi 



Given a set of denial constraints F and an instance r, all the repairs of r with respect to F can 
be succinctly represented as the conflict hypergraph. This is a generalization of the conflict graph 
defined in for FDs only. 

Definition 5 The conflict hypergraph Qp^r is a hypergraph whose set of vertices is the set of tuples 
in r and whose set of edges consists of all the sets {ti, t2, ■ ■ - ti} such that ti,t2-, ■ ■ - ti G and there 
is a constraint 

\/xi,X2, ■ ■ ■ xi^[R{xi) A R{x2) A ... A R(xi) A a;2, . . . xi)] 

in F such that ti,t2, ■ ■ - ii violate together this constraint, which means that there exists a substitution 
p such that p{xi) — ii, p{x2) = ^2, ■ ■ ■ p{xi) = U and that (j){ti,t2, ■ ■ - ii) is true. 

By an independent set in a hypergraph we mean a subset of its set of vertices which does not 
contain any edge. 

Proposition 2 Each repair of r w.r.t. F corresponds to a maximal independent set in GF,r- 
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3.1 Positive results 

Theorem 1 For every set F of denial constraints and quantifier-free sentence $, I?f,$ is in PTIME. 

Proof. We assume the sentence is in CNF, i.e., of the form $ = $i A $2 A . . . where each $i is 
a disjunction of ground Hterals. $ is true in every repair of r if and only if each of the clauses 
is true in every repair. So it is enough to provide a polynomial algorithm which will check if for a 
given ground clause true is a consistent answer. 

It is easier to think that wc are checking if for a ground clause true is not a consistent answer. 
This means that we arc checking, whether there exists a repair r' in which is true for some i. 
But -i^j is of the form R{ti) A Bih) A ... A R{tm) A -^R{im+i) ■ • • -^R{in), where ij are tuples of 
constants. Thus it is enough to check two conditions: 

1. whether for every j, m+1 < j <n, ij ^ r or there exists an edge Ej e GF,r such that tj G Ej, 
and 

2. there is no edge E e Qp^r such that E C r' where 

r' = {h,... ,im}U y {Ej-{tj}). 

m+l<j<n,tj 

If the conditions are satisfied, then a repair in which is true can be built by adding to r' 
new tuples from r until the set is maximal independent. The conditions can be checked by a 
nondeterministic algorithm that needs n — m nondeterministic steps, a number which is independent 
of the size of the database, and in each of its nondeterministic steps selects one possibility from a set 
whose size is polynomial in the size of the database. So there is an equivalent PTIME deterministic 
algorithm. ■ 

Note that the above result holds also for constraints and queries involving more than one relation. 
The notion of conflict hypergraph needs to be appropriately generalized in this case. 

Theorem 2 Let F consist of a single FD. Then for each sentence Q of the form 3t[R{t) A (p{t)] 
(where (p is quantifier-free and only built-in predicates occur there), there exists a sentence Q' such 
that for every database instance r, r \=f Q iff r \= Q' . Consequently, Df,q is in PTIME. 

Proof. The FD is Ai . . . Ai ^ Ai+i, . . . Ai+m, where / + m is not greater than the arity k of R. Let 
X be a vector of distinct variables of length /, y and yi vectors of distinct variables of length m, and 
z, Z\ and zi vectors of distinct variables of length k— (Z + m). Then, the query Q' is as follows: 

35,y,^Vyi,^i3^2[-R(S,^,^) A^!>(S,y,^) A \R{^x,yx,z{) ^ [R{x,yi,Z2) A (t){x,yi,Z2)]]]. 

■ 

We show now that the above results are the strongest possible, since relaxing any of the restric- 
tions leads to co-NP-completeness. This is the case even though we limit ourselves to key FDs. 



5 



3.2 One key dependency, two query literals 

Theorem 3 There exist a key FD f and a query Q = 3x,y,z[R{x,y,c) A R{z,y,c')], for which 
Dy^ Q is co-NP-data-complete. 

Proof. Reduction from MONOTONE 3-SAT. The FD is A ^ BC. Let $ = 0i A. . . c^mA^^+i ...Aipi 
be a conjunction of clauses, such that all occurrences of variables in <pi are positive and all occurrences 
of variables in ipi are negative. We build a database with the facts R{i,p, c) if the variable p occurs 
in the clause ipi and R{i,p, c') if the variable p occurs in the clause Now, there is an assignment 
which satisfies $ if and only if there exists a repair of the database in which Q is false. To show the 
implication, select for each clause 0j one variable Pi which occurs in this clause and whose value 
is 1 and for each clause '^j one variable Pi which occurs in i/'j and whose value is 0. The set of facts 
{R{i,Pi, c) : i < m} U {R{i,Pi, c') : m + 1 < i < /} is a repair in which the query Q is false. The <= 
implication is even simpler. ■ 

3.3 Two key dependencies, one query literal 

By a bipartite edge-colored graph we mean a tuple Q = {V, E, B, G) such that (V, E) is an undirected 
bipartite graph and E = B\JG ion some given disjoint sets B, G (so we think that each of the edges 
of Q has one of the two colors). 

Definition 6 Let Q = {V,E,B,G) be a bipartite edge-colored graph, and let F c E. We say that 
F is maximal V-free if: 

1. F is a maximal (w.r.t. inclusion) subset of E with the property that neither F{x,y) A F{x,z) 
nor F{x, y) A F{z, y) holds for any x, y, z. 

2. Fr\B = %. 

We say that Q has the max-V-free property if there exists F which is maximal V-free. 

Lemma 1 Max-V-free is an NP-complete property of bipartite edge-colored graphs. 

Proof Reduction from 3-COLORABILITY. Let H = {U, D) be some undirected graph. This is how 
we define the bipartite edge-colored graph Qu'- 

1. y = : u G ?7, e S {m,n,r,g,b}}, which means that there are 10 nodes in the graph Q 
for each node of H; 

2. G{vm,v'^),G{vm,v'^),G{vn,v't^),G{vn,v'g) a,nd G{vr,v!^),G{vb,v'^),G{vb,v'J,G{vg,v'J hold for 
each V €U; 
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3. B{ve,v'^) holds for each v E U and each pah e,e G {r, g, b} such that e ^ e; 

4. B{v^,u'^) holds for each e G {r, g, b} and each pair u,v G U such that u). 

Suppose that H is 3-colorable. We fix a coloring of H and construct the set F. For each v G U: 
if the color of v is Red, then the edges G(wm, I't,), G(t;„, w^) and G(u6, w^„), G'(wg, w^) are in i^. If 
color of V is Green, then the edges G{vm, K)' G{vn, v^) and G(ur, u'n), G(u6, w^) are in F, and if the 
color of V is Blue, then the edges G(wm, w^), G(w„, u^) and G{vr,v'^),G{vg,v'„) are in i^. It is easy 
to see that the set F constructed in this way is maximal V-free. 

For the other direction, suppose that a maximal V-free set F exists in Q-j-^. Then, for each v G U 
there is at least one node among Vr,Vg, Vf, which does not belong to any G-edge in F. Let be this 
node. Also, there is at least one such node (say, v'^) among v'^,v'g,v'ij. Now, it follows easily from the 
construction of Qn that if F is maximal V-free then e = e. Let this e be color of v in Q. It is easy 
to check that the coloring defined in this way is a legal 3-coloring oi Q. ■ 

Theorem 4 There is a set F of 2 key dependencies and a query Q = 3x,y[R{x,y,b)], for which 
Dp Q is co-NP-data-complete. 

Proof. The 2 dependencies are A BC and B AC. For a given bipartite edge-colored graph 
Q = {V,E,B,G) we build a database with the tuples (x,y,g) if G{x,y) holds in Q and (x,y,b) 
if B{x, y) holds in Q. Now the theorem follows from Lemma ^ since a repair in which the query 
3x, y R{x, y, b) is not true exists if and only if Q has the max- V-free property. ■ 

3.4 One denial constraint 

By an edge-colored graph we mean a tuple Q — (V, E, P, G, B) such that (V, E) is a (directed) graph 
and i? = P U G U S for some given pairwise disjoint sets P, G, B (which we interpret as colors). 
We say that the edge colored graph Q has the y property if there are x,y,z,t G E such that 
E{x, y), E{y, z),E{y, t) hold and the edges E[y, z) and E{y, t) are of different colors. 

Definition 7 We say that the edge-colored graph (y, E, P, G, B) has the max-y -free property if there 
exists a subset F of E such that F f] P = $ and : 

1. {V,F,PnF,GnF,Bn F) does not have the y -property; 

2. F is a maximal (w.r.t. inclusion) subset of E satisfying the first condition; 

Lemma 2 Max-y -free is an NP-complete property of edge-colored graphs. 

Proof. By a reduction of 3SAT. Let <I> = 0i A 02 A . . . A 0; be conjunction of clauses. Let pi,p2, . . .pn 
be all the variables in This is how we define the edge-colored graph Q^: 
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1. V ^ {oi, bi,Ci,di : 1 < i < n} U {ci, fi, gi '■ 1 < i < which means that there are 3 nodes in 
the new graph for each clause in <f> and 4 nodes for each variable. 

2. P{ai,bi) and P{ej,fj) hold for each suitable 

3. G{hi,di) and G{ej,gj) hold for each suitable 

4. 5(6;, Ci) holds for each suitable i; 

5. G{di,ej) holds if occurs positively in <j)j] 

6. B{di^ Bj) holds if occurs negatively in 0^; 

7. E ^ BUGUP. 

Now suppose that <i> is satisfiable, and that /i is the satisfying assignment. We define the set 
F C E as follows. We keep in F all the G-colored edges from item 3 above. If = 1 then we 

keep in F all the G edges leaving di (item 5). Otherwise we keep in F all the B edges leaving di 
(item 6). Obviously, FCiP — (I}. It is also easy to see that F does not have the ^-property and that 
it is maximal. 

In the opposite direction, notice that if an F, as in Definition ^ does exist, then it must contain 
all the G-edges from item 2 above - otherwise a P edge could be added without leading to the 
3^-property. But this means that, for each i, F can either contain some (or all) of the i?-edges 
leaving di or some (or all) of the G-edges. In this sense F defines a valuation of variables. Also, if 
F is maximal, it must contain, for each j, at least one edge leading to Cj. But this means that the 
defined valuation satisfies <&. ■ 

Theorem 5 There exist a denial constraint f and a query of the form Q = 3x , y[R{x , y , p)] , for 
which Dyj Q is co-NP-data-complete. 

Proof. The denial constraint / is: 

Vx, y, z, s, s', s" ^[i?(a;, s) A R{y, z.s') A R{y, w, s") A s' ^ s"] 

For a given edge-colored graph Q = {V,E,P,G,B) we build a database with the tuples R{x,y,g) 
if G{x,y) holds in Q, with R{x,y,p) if P{x,y) holds in Q and with R(x,y,b) if B{x,y) holds in Q. 
Now the theorem follows from Lemma ^ since a repair in which the query Q is not true exists iff Q 
has the max-J^-free property. ■ 
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